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ABSTRACT 

In order to associate complex traits with genetic polymor- 
phisms, genome-wide association studies process huge data- 
sets involving tens of thousands of individuals genotyped for 
millions of polymorphisms. When handling these datasets, 
which exceed the main memory of contemporary computers, 
one faces two distinct challenges: 1) Millions of polymor- 
phisms come at the cost of hundreds of Gigabytes of geno- 
type data, which can only be kept in secondary storage; 2) 
the relatedness of the test population is represented by a 
covariance matrix, which, for large populations, can only fit 
in the combined main memory of a distributed architecture. 
In this paper, we present solutions for both challenges: The 
genotype data is streamed from and to secondary storage us- 
ing a double buffering technique, while the covariance matrix 
is kept across the main memory of a distributed memory sys- 
tem. We show that these methods sustain high-performance 
and allow the analysis of enormous datasets. 

Keywords 

genome-wide association study, mixed-models, generalized 
least squares, out-of-core, distributed memory, Elemental 

1. INTRODUCTION 

Whole Genome Association Studies, also known as Genome- 
Wide Association (GWA) studies, became the tool of choice 
for the identification of loci associated with complex traits. 
The association between a trait of interest and genetic poly- 
morphisms (usually single nucleotide polymorphisms, SNPs) 
is studied using thousands of people typed for hundreds of 
thousands of polymorphisms. Thanks to these studies, hun- 
dreds of loci for dozens of complex human diseases and quan- 
titative traits have been discovered [6]. In GWA analysis, 
one of the most used methods to account for the genetic sub- 
structure due to relatedness and population stratification is 



the variance component approach based on mixed models [2, 
11]. While effective, mixed-models based methods are com- 
putationally demanding both in terms of data management 
and computation. The objective of this research is to make 
large-scale GWA analyses more affordable. 

Computationally, a GWA analysis based on approximations 
to the mixed-model applied to a set of n individuals and 
m genetic markers (SNPs) boils down to the solution of m 
generalized least-squares (GLS) problems 



h := (XTM~ 1 X 1 ) 



' X XfM' 



y, 



with 



(1) 



where the Xi G K' ixp is the design matrix, M G R nxn is 
the covariance matrix, the vector y G R n contains the phe- 
notypes, and the vector bi G W expresses the relation be- 
tween a variation in the SNP (Xi) and a variation in the 
trait (y). Additionally, M is symmetric positive definite 
(SPD), 2 < p < 20, n ranges approximately between 10 
and 10 5 , and m ranges between 10 5 and 10 8 . Finally, Xi 
is full rank and can be viewed as composed of two parts, 



X % = (X L \X m ), with X L G 



»nX (p- 



J) and X Ri G 



where Xl is constant across all m genetic markers. 

The first reported GWA study dates back to 2005: 146 in- 
dividuals were genotyped, and about 103,000 SNPs were 
analyzed [7]. Since then, as the catalog of publishd GWA 
analyses shows [5, 9], the number of published studies has 
increased steadily, up to 2,404 in 2011 and 3,307 in 2012. A 
similar growth can be observed both in the population size 
and in the number of SNPs: Across all the GWAS published 
in 2012, on average, the studies used 15,471 individuals, with 
a maximum of 133,154, and 1,252,222 genetic markers, with 
a maximum of 7,422,970. From the perspective of Eq. (1), 
these trends present concrete challenges, especially in terms 
of memory requirements. As both M G K nxn and the AVs 
compete for the main memory, two scenarios arise: 1) If 
n is small enough, M fits in memory, and the Xi's have 
to be streamed from disk; 2) if M does not fit in memory, 
then both data and computation have to be distributed over 
multiple compute nodes. In this paper we present efficient 
algorithms for both scenarios. 

Several notable implementations for GWA studies already 
exist: GenABEL is a widely spread library for genome stud- 
ies [1]; FaST-LMM is a program specifically designed for 



large datasets [8]; recently, a new high-performance imple- 
mentation, to which we refer as SMP-OOC, was introduced 
in [3]. None of these algortihms are meant for distributed 
memory systems; hence, for all of them the population size 
n is limited by the memory of a single node. 

The rest of this paper is structured as follows. A mathemat- 
ical algorithm used to solve Eq. (1) is discussed in Section 2. 
Then, a technique to make the algorithm feasible for an arbi- 
trary numbers of SNPs out-of-core is presented in Section 3. 
Finally, in Section 4, the algorithm is further extended to 
deal with large population sizes by means of distributed 
memory architectures. 

2. THE MATHEMATICAL ALGORITHM 

The standard route to solving one of the GLS's in Eq. (1) is 
to reduce it to an ordinary least squares problem (OLS), 



h 


= (X^Xi) \ 


through the operations 




l LL T ■- M 


(Cholesky factorization 


2 Xi := L Xi 


(triangular solve) 


3 V ■■= L _1 y 


(triangular solve) 



The resulting OLS can then be solved by two alternative 
approaches, respectively based on the QR decomposition of 
Xi, and the Cholesky decomposition of X t Xi. In general, 
the QR-based method is numerically more stable; however, 
in this specific application, since X t Xi £ W xp is very small 
and X is typically well conditioned, both approaches are 
equally accurate. In terms of performance, the solution via 
Cholesky decomposition (detailed below) is slightly more ef- 
ficient. 



4 Si ■— Xi Xi 

5 6, := Xi y 

6 h := S'% 



(symmetric matrix product) 
(matrix times vector) 
(linear system via Cholesky) 



In this paper, we only consider this approach. 

2.1 Multiple SNPs 

When the six steps for the solution of one OLS are applied 
to the specific case of Eq. (1), it is possible to take advantage 
of the structure of Xi and avoid redundant computation. 



Plugging Xi 
obtain 



(X L \X m ) into Xi :- L" 1 Xi (line 2), we 
\XL\XRij ■— [L Xl\L Xm), 



that is, Xl '■= L~ x Xl, and Xm :— L~ 1 Xm. These assign- 
ments indicate that the quantity Xl can be computed once 
and reused across all the SNPs. 



Similarly, for Si ■— Xi Xi (line 4), we have 1 

Stl 



SbLi 



SsRi 



X l Xl 



Xr^Xl 



X RiX 



Ri^-Ri 



1 The subscript letters l, r, t, and b stand for Left, Right, 
Top, and Bottom, respectively. 
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Xl 


= L- l X L , y-L-^y 
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Stl 


:= XIX L , b T := X T L y 
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for 


i in {1, . . . ,m} 


5 




XRi := L~ Xri 


6 




SsLi -~ X RiX L 


7 




SBRi '■— X Ri X Ri 


8 




b B i ■= X Ri y 


9 




Tt ?. ( STL 


* N 




\ OBLi 


SBRi j 



b, 



10 

li 



bi ■— S bi 



end 



Algorithm 1: Optimized algorithm for the solution 
of Eq. (1). 



from which 




Stl '■ 


= XiA L GR (p - 1)x(p - 1) 


SBLi '■ 


= X^xleR 1 *^, a 


SBRi '■ 


7 1 T 

- X Ri X Ri £ R, 



and 



indicating that Stl, the top left portion of Si, is indepen- 
dent of i and needs to be computed only once. 2 Finally, the 
same idea applies also to bi (line 5), yielding the assignments 

b T ■= X L y and b B i ■— X Ri y. 

The computation for the whole Eq. (1) is given in Algorithm 1. 
There, all the operations independent of i are moved out- 
side the loop, thus lowering the overall complexity from 
0(n 3 + mn 2 p) down to 0{n 3 + mn 2 ). 3 This algorithm con- 
stitutes the basis for the large-scale versions presented in the 
next two sections. 

3. OUT-OF-CORE 

GWA studies often operate on and generate datasets that 
exceed the main memory capacity of current computers. 
For instance, a study with n = 20,000 individuals, m = 
10,000,000 SNPs, and p = 4, requires 1.49 TB to store the 
input data (M and Xi's), and generates 305 MB of out- 
put. 4 To make large analyses feasible, regardless of the 
number of SNPs, Fabregat et al. proposed an extended ver- 
sion of Algorithm 1, described below, that streams XRi and 
bi from secondary storage, by means of asynchronous I/O 
operations [3]. 

In order to avoid any overhead, the vectors XRi (and bi) are 
grouped into blocks Xuk (and btik) of size mtik, and read 
(written) asynchronously using double buffering. The idea is 
to logically split the main memory in two equal regions: One 



2 Since Si is symmetric, its top-right and bottom-left quad- 
rants are the transpose of each other; we mark the top-right 
quadrant with a *, indicating that it is never accessed nor 
computed. 

3 Since in most analyses m^> n, the complexity reduces by 
a factor of p, from 0(mn 2 p) down to 0(mn 2 ). 
4 In practice the size of the output is even larger, because in 
addition to bi, a p x p symmetric matrix is also generated. 



i LL 1 := M 



2 X L ■= L~ X X L , y := L" 1 



2/ 

3 Stl := X L X L , b T := X L y 

4 load_start first Xuk 

5 for each blk 

6 load_wait current Xuk 

7 if not last blk : load_start next Xuk 

8 Xblk '■= L Xblk 

for i in {1, . . . , muk} 

set Xm ■— X b ik[i] 



OBLi '— XjuXl 



b Bi :- X m y 
set Si '■= 



JTL 



SsLi 

bi := S^k 
set bukU] '■— bi 



SBRi '— XjiiXni 



h:= 



Sbi 



14 
15 
16 
17 

18 

19 end 

20 store_wait last buk 



end 

if not first blk : store_wait previous buk 
store_start current buk 



Algorithm 2: Out-of-core version of Algorithm 1: 
The Xm and bi are streamed from and to disk in 
blocks. Asynchronous I/O operations are in green. 



region is devoted to the block of data that is currently pro- 
cessed, while the other is used to store the output from the 
previous block and to load the input for the next one. Once 
the computation on the current block is completed, the roles 
of the two regions are swapped. The algorithm commences 
by loading the first block of SNPs Xuk from disk into mem- 
ory; then, while the GLS's corresponding to this block are 
solved, the next block of SNPs is loaded asynchronously in 
the second memory region. (Analogously, the previous bbik 
is stored, while the current one is computed.) 

When dealing with large analyses, an important optimiza- 
tion comes from, whenever possible, processing multiple 
SNPs at once: Algorithm 2 shows how slow vector opera- 
tions on Xm can be combined together, originating efficient 
matrix operations on X M k £ R" xm " fc . 

3.1 Shared memory implementation 

The implementation of Algorithm 2, called SMP-OOC, makes 
use of parallelism in two different ways [3] . The operations 
in lines 1 through 8 are dominated by BLAS3 and take 
full advantage of a multithreaded implementation of BLAS 
(LAPACK). By contrast, the operations within the inner- 
most loop (lines 11 through 14), only involve very small or 
thin matrices, for which BLAS and especially multithreaded 
BLAS are less efficient. Therefore, thery are scheduled in 
parallel using OpenMP in combination with single-threaded 
BLAS and LAPACK. 

We compiled SMP-OOC, written in C, with the GNU 
C compiler (version 4.4.5) and linked to Intel's Math Ker- 
nel Library (MKL version 10.3). All tests were executed on 
a system consisting of two six-core Intel X5675 processors, 
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Figure 1: Performance of SMP-IC and SMP-OOC 
for increasing m. The vertical line is the limit for 
the in-core version imposed by the RAM size, n = 
10,000, p = 4. 



1 week 



1 day 



1 hour 



X56.8 



X6.3 




3.6- 10' 



Figure 2: Performance of SMP-OOC compared to 
GenABEL and FaST-LMM. n = 10,000, p = 4. 



running at 3.06 GHz, equipped with 32 GB of RAM, and 
connected to a 1 TB hard disk. 

Preliminary measurements have shown that changing p £ 
{1, . . . , 20} results in performance variation on the order of 
system fluctuations (below 1%). We therefore consider p = 
4, a value encountered in several GWA studies, throughout 
all our experiments. 

In the first experiment, we compare the efficiency of SMP- 
OOC with SMP-IC, an equivalent in-core version. We fixed 
n = 10,000, p = 4, and we let m vary between 10 3 and 10 7 . 
For the out-of-core version, the SNPs were grouped in blocks 
of size rribik = 5,000. Figure 1 shows that SMP-OOC scales 
linearly in the number of SNPs, well beyond the maximum 
problem size imposed by the 32 GB of RAM. Furthermore, 
the fact that the lines for the in-core and out-of-core algo- 
rithms overlap perfectly confirms that the I/O operation to 
and from disk are entirely hidden by computation. 

In the second experiment, Figure 2, we show the perfor- 
mance of SMP-OOC with respect to that of other two sol- 
vers: FaST-LMM, a program designed for GWAS on large 
datasets [8] and GenABEL, a widely spread library for 
genome studies [1]. Again, we fixed n = 10,000 and p = 4, 
while m varies between 10 6 and 3.6 ■ 10 7 . The fairly con- 
stant observed speedups of SMP-OOC over FaST-LMM 
and GenABEL are, at m = 3.6 • 10 7 , 6.3 and 56.8, respec- 
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Figure 3: Default 2D matrix distribution on a 2 x 3 
process grid. 
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tively. 

4. DISTRIBUTED MEMORY 

While SMP-OOC scales up to an arbitrarily large amount 
of SNPs m, the main memory is still a limiting factor for 
the population size n: In fact, the algorithm necessitates 
the matrix M G R nxn (or equivalently, its Cholesky factor 
L) to reside fully in memory. Due to the triangular solve 
(Algorithm 2, line 2), keeping the matrix in the secondary 
storage is not a viable option. Our approach here consists 
in distributing M, L, and all matrices on which L oper- 
ates, across multiple compute nodes, lifting any constraint 
on their size. 

4.1 Elemental 

As a framework for distributed-memory dense linear algebra 
operations, we use Elemental [10]. This is a C++ library, 
based on the Message Passing Interface (MPI), that operates 
on a virtual two-dimensional grid of processes; the name is 
inspired by the fact that, in general, matrices are cyclically 
distributed across a 2D grid of processes in an element-wise 
fashion. This principal distribution 5 is shown in Figure 3; 

Algebraic operations on distributed matrices involve two 
stages: data redistribution (communication), and invocation 
of single-node BLAS or LAPACK routines (computation). 
Optimal performance is attained by minimizing communi- 
cation within the redistributions. In most cases, as shown 
in [10], this is achieved by choosing the process grid to be 
as close to a perfect square as possible. 

While a square process grid is optimal for performance, since 
all processes only hold non-contiguous portions of the ma- 
trix, it complicates loading contiguously stored data from 
files into a distributed matrix. In the context of GWAS, 
the algorithm has to load two objects of different nature: 
the matrix M, and the collections of vectors Xuk\ the spe- 
cial nature of the latter determines that the vectors can be 
loaded and processed in any order. 

For loading M, we first read contiguous panels into the lo- 
cal memory of each process via standard file operations, and 
then construct the global (distributed) version of M by ac- 
cumulating the panels. This is done via Elemental's axpy- 
interface, a feature that makes it possible to add node-local 
matrices to a global one. 

For loading X^ik instead, a collection of contiguously stored 
vectors is read into memory through more efficient means 



Figure 4: ID matrix distribution on a 1 x 6 process 
grid. 



than the axpy- interface by exploiting that, as long as con- 
sistently handled, the order of the vectors is irrelevant. The 
trick is to use a matrix that is distributed on a virtual ID 
reordering of the grid into a row of processes. As shown in 
Figure 4, the process-local data of such a matrix is a set of 
full columns, which can be loaded from a contiguous data- 
file. While these local columns are not adjacent in the dis- 
tributed matrix, Elemental guarantees that all algebraic 
operations performed on them maintain their order. For 
performance reasons, prior to any computation, the matrix 
on the ID ordering of this grid needs to be redistributed 
to conform to the initial 2D process grid (Figure 3). This 
redistribution, provided by Elemental, can internally be 
performed most efficiently through a single MPI_Alltoall if 
the ID grid is the concatenation of the rows of the 2D grid. 6 

4.2 The parallel algorithm 

In Algorithm 3, we present the distributed-memory version 
of Algorithm 2 for np processes; the matrices that are dis- 
tributed among the processes and the corresponding opera- 
tions are highlighted in blue; the quantities that differ from 
one process to another are instead in red. 

The algorithm begins (line 1) by loading the first rnblk vec- 
tors Xm into a local block Xuk on each process asynchro- 
nously. It commences with the initially distributed M, Xl, 
and y, and computes L, Xl, and y (lines 2-3). Then, Xl 
and y, local copies of Xl and y, respectively, are created on 
each process (line 4). Since small local computations are sig- 
nificantly more efficient than the distributed counterparts, 
Stl and &t are computed redundantly by all processes (line 
5). 

In line 9, the asynchronously loaded blocks Xuk are - 
without any communication or memory transfers — seen as 
the columns of Xuk that are cyclically distributed across a 
ID process grid as described in Section 4.1. Since in Ele- 
mental matrix operations require all operands to be in the 
default distribution across the 2D grid, Xuk and Xuk are 
redistributed before and after the computation in line 10, 
respectively. Once X uk is computed and redistributed, in 
line 11, each process views its local columns of this matrix as 
Xf,ik', since the distributions of Xuk and Xuk are identical, 
these — without communication or transfers — correspond 
to the columns of Xuk- 



Tn Elemental's notation: [MC, MR]. 



Tn Elemental: [*, VR]. 



1 load_start first Xuk 

2 LL T := M 

3 X L :=£ _1 *£j_ y-L^y 

4 copy X L := X L , y :=y 

5 Stl ■— X l Xl , br ■— X L y 

6 for each blk 

7 load_wait current Xuk 

8 if not last blk: load_start next Xuk 
set X U k ■— combine (A Wfe ) 
Xuk '■= L Xuk 
set Xuk '■= localpart (X wfc ) 

Sblk '■= Xtlk Xl 

for i in {l,....an} 
set Xfli := A' wfc [i] 
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10 

n 

12 
13 

14 
15 
16 
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Sbli — Suk[i] 



SBRi 
bsi '■- 

set Si 



Xm Xm 

Stl 



Xm T y 



b z := S^h 
set bwfcfi] 



18 
19 
20 
21 
22 

23 end 

24 store_wait last buk 



Sshi 

■bi 



Sbi 



b, 



l>T 



end 

if not first blk : store_wait previous buk 
store_start current bbik 
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Figure 5: Performance of Elem-OOC as a function 
of 77i. Here, n = 40,000, p = 4, and m ranges from 
2,048 to 8.2 ■ 10 6 . The vertical lines are limits for a 
theoretical in-core version of the parallel algorithm 
imposed by the accumulated RAM sizes. 
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Algorithm 3: Distributed memory version of 
Algorithm 1. Asynchronous I/O operations are de- 
picted green, distributed matrices and operations in 
blue, and quantities that differ across processes in 
red. 



In addition to blocking Xm and bs% , the computation of all 
row vectors Sbl% belonging to the current block is combined 
into a single matrix product (line 12) resulting in the Sblz 
being stacked in a block Sbik- In line 14, Sbu is selected 
from Sbik, along with Xm from Xuk for the innermost loop. 
This loop then computes the local buk independently on 
each process. Finally, buk (whose columns bi corresponds 
to the initially loaded vectors Xm within Xuk) is stored 
asynchronously, while the next iteration commences. 

4.3 Performance Results 

We compile Elem-OOC, the C++-implementation of Algorithm 
with the GNU C compiler (version 4.7.2), use Elemental 
(version 0.78-dev) with OpenMPI (version 1.6.4) and link 
to Intel's Math Kernel Library (MKL version 11.0). In our 
tests, we use a compute cluster with 40 nodes, each equipped 
with 16 GB of RAM and two quad-core Intel Harpertown 
E5450 processors running at 3.00 Ghz. The nodes are con- 
nected via InfiniBand and access a high speed Lustre file 
system. 











- 


np — 16 

np — 32 

np — 64 


64 GB 










32 GB / 




16 GB y : / 






, — - — -^ i : i 





20.000 



40,000 



60,000 



80,000 



10° 



Throughout all our experiments, we use the empirically op- 

256 by choosing rribik = 256np. 



timal local block-size mblk 

np 



4.3.1 Processing huge numbers ofSNPs out-of-core 

Since Elem-OOC incorporates the double-buffering method 
introduced in Section 3, it can process datasets with arbi- 



Figure 6: Performance of Elem-OOC as a function 
of n. p = 4, 77i = 65,536, and n ranges from 5,000 
to 100,000. The vertical lines indicate the limits im- 
posed by the accumulated RAM sizes. 



trarily large m without introducing any overhead due to I/O 
operations. To confirm this claim, we perform a series of ex- 
periments, using np — 8, 16, 32, and 64 cores (1, 2, 4, and 8 
nodes) to solve a system of size n = 40,000 and p — 4 with 
increasing dataset size m. The performance of these experi- 
ments is presented in Figure 5, where the vertical lines mark 
the points at which the 16 GB of RAM per node are insuf- 
ficient to store all m vectors Xm- The plot shows a very 
gsmooth behavior with m (dominated by the triangular solve 
in Algorithm 3, line 10) well beyond this in-core memory 
limit. 

4.3.2 Increasing the population size n 

We now turn to the main goal of our effort: performing 
computations on systems whose matrix M G R nxn exceeds 
the capacity of the main memory. For this purpose, we use 
777 = 65,536, p — 4 and execute Elem-OOC on np = 8, 
16, 32, and 64 cores (1, 2, 4, and 8 nodes) with increasing 
matrix size n. Figure 6 reports the performance of these 
executions, which is dominated by the cubic complexity of 
the Cholesky factorization of M (Algorithm 3, line 2). The 
vertical lines indicate where the nodes' memory would be 
exceeded by the size of the distributed M and the buffers 
for Xbik- The plot shows that our implementation succeeds 
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Figure 7: Performance of Elem-OOC as a function 

of np. p — 4, and np ranges from 8 to 64. 



in overcoming these memory limitations through increasing 
the number of nodes. 

4.3.3 Strong scalability 

In practice, the problem sizes are bound to the specific 
GWAS and the interest lies in solving Eq. (1) as fast as 
possible. In the following experiment, we investigate how 
the time to solution is reduced by Elem-OOC through in- 
creasing the number of compute units, while keeping the 
problem size constant. In Figure 7, we present the perfor- 
mance attained for four different problem sizes with 8 up to 
64 cores (1 through 8 nodes). It shows perfect scalability for 
increasing the number of processes from 8 to 16, reducing 
the runtime by a factor of 2. With even more processes, 
the parallel efficiency decreases, since the local portions of 
L become too small, but execution time is reduced further. 

5. CONCLUSION 

We presented two parallel algorithms for solving the general- 
ized least squares problems that arise in genome-wide asso- 
ciation studies (GWAS). They address the issue of growing 
dataset sizes due to the number of studied polymorphisms 
m and/or the population size n. 

The first algorithm uses a double buffering technique in or- 
der to process datasets with arbitrarily large numbers of 
genetic polymorphisms. Compared to other wide-spread 
GWAS-codes, this algorithm's shared memory implementa- 
tion, SMP-OOC, was shown to be at least one order of 
magnitude faster. 

The second algorithm enables the processing of datasets in- 
volving large populations by storing the covariance matrix 
in the combined main memory of distributed memory archi- 
tectures. Elem-OOC, the implementation of this algorithm, 
was shown to scale very well in both the population size and 
the number of processes used. 

Together, these two algorithms form a viable basis for the 
challenges posed by the scale of current and future genome- 
wise association studies. 

5.1 Future Work 

The work presented in this paper can be extended in several 
ways. 



Hybrid parallelism, i.e., using multithreaded BLAS 
and LAPACK, as well as OpenMP, offers further po- 
tential to boost the performance and efficiency of our 
distributed memory implementation Elem-OOC. 

When a GWAS is interested in more than one trait y, 
a further dimension j is added to the set of generalized 
least squares problems in Eq. (1): 



bij = (XfM^Xi 



1 XlM~ 1 y j 



with i — l,...,m and j — 1, . . . ,t. A highly effi- 
cient shared memory implementation for this problem 
is already presented in [4]; only a distributed memory 
implementation on the lines of Elem-OOC would be 
capable of solving this problem for large population 
sizes. 

• Since the covariance matrix M represents the related- 
ness of a diverse population, its few significant entries 
can be grouped close to the diagonal. This allows to 
significantly reduce computation time by operating on 
banded matrices. 
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